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ABSTRACT 

We present a new technique to estimate the level of contamination between photometric red- 
shift bins. If the true angular cross-correlation between redshift bins can be safely assumed 
to be zero, any measured cross-correlation is a result of contamination between the bins. We 
present the theory for an arbitrary number of redshift bins, and discuss in detail the case of two 
and three bins which can be easily solved analytically. We use mock catalogues constructed 
from the Millennium Simulation to test the method, showing that artificial contamination can 
be successfully recovered with our method. We find that degeneracies in the parameter space 
prohibit us from determining a unique solution for the contamination, though constraints are 
made which can be improved with larger data sets. We then apply the method to an obser- 
vational galaxy survey; the deep component of the Canada France Hawaii Telescope Legacy 
Survey. We estimate the level of contamination between photometric redshift bins and demon- 
strate our ability to reconstruct both the true redshift distribution and the true average redshift 
of galaxies in each photometric bin. 

Key words: galaxies: distances and redshifts - galaxies: photometry - techniques: photomet- 
ric - methods: analytical - large-scale structure of Universe 
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1 INTRODUCTION 

The time-intensive nature of spectroscopy and the immense number 
of galaxies in current and future surveys have secured the place of 
photometric redshifts among the most valuable astronomical tools. 
Nearly all cosmological probes are sensitive to distance, making 
redshifts particularly important to cosmology. While spectroscopy 
is the most accurate and precise way to measure redshift, it is far too 
time-intensive to apply to current or future surveys which require 
redshift estimates for millions of galaxies. Hence, photometric red- 
shifts and a thorough understanding of their uncertainties are vital 
to the study of large galaxy surveys. 

The observed colours of a galaxy are often unable to uniquely 
specify a galaxy type and redshift. Degeneracies in fitting galaxy 
colours can result in significant misidentifications causing contam- 
ination between all redshifts. In this paper we investigate contami- 
nation between photometric redshift bins and present a method for 
estimating the contamination using the angular correlation func- 
tion. This method relies only on the photometric redshifts, it does 
not require a spectroscopic sample. However, it is likely that the 
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photometric redshifts will have been calibrated with a spectro- 
scopic training sample. 

Quantifying photometric redshift contamination is essential to 
exploit the full potential of future photometric surveys, such as the 
Large Synoptic Survey Telescope (LSST) or the Supernova Accel- 
eration Probe (SNAP), which aim to constrain dark energy. Weak 
tensing and baryon acoustic oscillations (BAG) are both sensitive 
to the mean redshift of a bin, requiring an unbiased measurement 
on the order of 0.001 — 0.005 in redshift s o that the constraints on 
dark energy are not significantly degraded jHuterer et al.Ll20()4h . 

Several studies have investigated the use of the angular cor- 
relation function in determining the true redshift distribu t ion of 
galaxies binned by photometric redshift jSchneider et all llOOd : 
iNewmarj . [2008; Zhana et al., 2009). These works have adopted 
LSST-like survey parameters and focused on the Fisher informa- 
tion matrix as a means of forecasting constraints. ISchneider et al.l 
(2006|) show that the angular correlation function in the linear 
regime can be used to measure the mean redshift to an accu- 
racy of 0.01, noting that there is further constrain i ng po wer at 
smaller scales. The method investigated by iNewmarj j2008l) relies 
on an overlapping spectroscopic sample so that the angular cross- 
correlation between it and the photometric redshift bins can be ex- 
ploited. They find that the desired accuracy on the mean redshift 
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can be reached pr ovided a spe c trosco pic sample of 25,000 galaxies 
per unit redshift. IZhang et alj ( |2009l) investigate a self-calibration 
method, where the mean redshift in a bin is estimated from angu- 
lar cross-correlations between photometr ic redshift bins, ver y sim- 
ilar to the one presented in this work and lErben et They 
demonstrate that self-calibration can reach the required accuracy 
on the mean redshift if additional information from weak lensing 
shear is used to help break parameter degeneracies. 

We focus here on the practical application of measuring photo- 
metric contamination in both simulated and real data. This is of in- 
terest not only for meeting the stringent requirements of future sur- 
veys as mentioned above, but also for other applications such as: a 
diagnostic tool for photometric redshifts; determining background 
samples for cluster lensing and estimating the true redshift distribu- 
tion for 2-D cosmic shear measurements. The details of our method 
for a strict two-bin analysis of t he CFHTLS - Archiv e-Research- 
Survey (CARS) are presented in lErben et^ 

I j2009l) . where we 

demonstrate that the contamination present in the bright spectro- 
scopic training sample is consistent with the contamination seen 
in the much deeper photometric redshift sample. This addresses a 
central concern when calibrating photometric redshifts with spec- 
troscopic redshifts. 

This paper is organised as follows: The angular correlation 
function and the estimators we use are presented in ij2] The details 
of the analytic method for estimating redshift bin contamination 
are discussed in ij3] which first addresses the general problem of 
contamination between an arbitrary number of redshift bins before 
focusing on the two-bin case and its extension to multiple redshift 
bins. The analytic model is tested using mock observational cata- 
logues in 21 where we show that contamination between redshift 
bins can be accurately determined with our model. In ijSjwe mea- 
sure contamination in a real galaxy survey demonstrating the abil- 
ity of the method to constrain the true (uncontaminated) redshift 
distribution and measure the average redshift of each photomet- 
ric redshift bin. Concluding remarks, including limitations of the 
method and ideas for future work, are presented in !j6l Details of 
the maximum likelihood method, including a detailed discussion 
of the covariance matrix, is left to appendix [A] The contamination 
model for three redshift bins is discussed in appendix iBl 



where {DiDj)e is the number of pairs separated by a distance 6 
between data sets i and j, {RR)e is the number of pairs separated 
by a distance 9 for a random set of points, A^r is the number of 
points in the random sample and Ni (Aj) is the number of points 
in data sample i (j). In this work we consider i and j to be non- 
overlapping redshift bins. The auto-correlation is described by the 
case i=j, and the cross-correla tion by the case i^j. 

A more robust estimator jLandv & Szalavl1l993h is 

_ iDiDj)e A^rAr. {DiR)e Ar. {DiR)g Ar 
'^'^ {RR)g AiAj {RR)e Ni {RR)g Aj ^ ' 

This is used when measuring the correlation function from data, 
either the Millennium Simulation in <j4]or the CFHTLS-Deep fields 
inig] 

Galaxies cluster in over-dense regions, leading to an excess 
number of pairs when compared to a random distribution of points. 
On small scales, the angular auto-correlation function of galaxies 
is positive for all redshifts, though the shape and amplitude vary as 
a function of redshift due to the evolution of structure formation. 
The angular cross-correlation between two distant redshift bins is 
zero since galaxies that are physically separated by large distances 
do not cluster with one another. 

When considering galaxies binned in redshift, neighbouring 
bins may have a significant cross-correlation, especially if the bin 
width is narrow, since a significant number of galaxies in each bin 
could be clustered with each other. In the case of photometric red- 
shifts the typical redshift error (Az ~ 0.05) will result in a large 
cross-correlation if the width of the bins is not much larger than 
this error. 

Photometric redshift bins that are not neighbouring should not 
be physically clustered with one another and their cross-correlation 
should be zero. Deviations from zero indicate that the two bins con- 
tain galaxies that are physically clustered. These galaxies may re- 
sult from contamination between the two bins or from the mutual 
contamination of both bins from other redshifts. 

Weak lensing magnification can cause galaxies at high redshift 
to cluster near lower redshift galaxies. This effect can be calculated 
and accounted for -we discuss this in ^ 



2 ANGULAR CORRELATION FUNCTION AND 
ESTIMATORS 

The two-point angular correlation function describes the amount 
of clustering in a distribution of galaxies relative to what would 
be expected from a random distribution. The angular correlation 
function cj can be interpreted as the excess probability of finding 
an object in the solid angle dil a distance 9 from another object. 
The total probability is given by 

dP = N[l + uj{9)]dn, (1) 

where A^ is the density of objects per unit solid angle jPeeblesl 
ll98Ch . Note that Ndil is the Poisson probability that the solid angle 
element is occupied by a galaxy. 

Many estimator s have been devised for the correlation func- 
tion. iKerscher et alj j200Q") present a comparison of the most 
widely used estimators. They mainly differ in the handling of edge 
effects, and for arbitrarily large number densities they all converge. 
In ^we employ the simplest estimator. 
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3 ANALYTIC DEVELOPMENT OF THE 
CONTAMINATION MODEL 

The goal of the contamination model is to measure the level of 
contamination between all photometric redshift bins by measuring 
the angular cross -correlation. We have presented the details of the 
two-bin model in lErben et al. 112009) and applied it to the CFHTLS- 
Archive-Research Survey (CARS). Here we develop a fully consis- 
tent multi-bin approach in iJSTT] before revisiting the two-bin case 
and extending it to a global pairwise analysis in ^j^^ 



3.1 Multi-Bin Analysis 

We first address the case of an arbitrary number of redshift bins, 
where each bin is potentially contaminating every other bin. The 
number of galaxies observed to be in the i**^ bin is Aj° which, by 
virtue of the mixing between bins, is not equal to the true number 
of galaxies in that redshift bin N-^ . We define /ij to be the num- 
ber of galaxies from bin i contaminating bin j as a fraction of the 
true number of galaxies in bin i. Therefore fijN-^ is the number of 
galaxies from bin i present in bin j. The fraction of galaxies in bin 
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i which do not contaminate other bins is taken to be fa which is 
convenient shorthand for 



/ii 



(4) 



where n is the number of redshift bins. 

The observed number of galaxies in bin i will be those galaxies 
which contaminate the bin, plus those galaxies from bin i which do 
not contaminate other bins. 



(5) 



where the k=i term accounts for those galaxies that remain in bin 
i while all other terms account for galaxies that have contaminated 
bin i. This results in a system of n equations which can be written 
as: 



//ll 

/l2 



/21 
/22 



/nl\ 
fn2 



\fln /2n 



fnnj 



(6) 



If the n X n matrix is invertible then we can solve the system 
for the true number of galaxies in each bin. In the limit of zero 
contamination the off-diagonal elements will tend to zero while 
the diagonal elements tend to unity. In this limit the matrix is triv- 
ially non-singular. If a matrix is strictly diagonally dominant then it 
follows from the Gershgorin circle theorem that it is non-singular. 
Therefore as long as 



(7) 



the matrix is strictly diagonally dominant and therefore is invert- 
ible. This condition is simply stating that a solution exits if the ma- 
jority of the galaxies from the i* true redshift bin do not contami- 
nate other bins. The case of uniform contamination, where each bin 
sends A';'^ / n galaxies to each other bin, results in a singular matrix 
where all rows are identical. 

The observed correlation functions can be derived by investi- 
gating which pairs contribute when correlating bin i and bin j. The 
observed number of data pairs between bins can be related to con- 
tributions from the true number of pairs. 



(AA)9 = EE(-°kA)J/ki/ij 



(8) 



Multiplying both sides by (^^y^ "'^j^^ and using Eq. l|2j to relate 
the pair counts to the correlation function yields 

= E E '^ki j^o^yo /ki /ij - (9) 



which can also be derived by considering (A'^^, A'j°). 

We assume that the true cross-correlation between any two 
redshift bins is zero. Hence, it is useful to rewrite Eq. ^ as 



'^ij - 2^^kk^o^o/ki/kj- 



(10) 



Note that this allows us to express the observed auto- and cross- 
correlations as linear combinations of the true auto-correlation 
functions. The following matrix equation follows from the above 



when we consider only the equations for the observed auto- 
correlations (i.e, when i=j): 



/ll 

/22 



/nl 
/n2 



\,/ln ./In 



f / 
J nil/ 



(11) 



We are again left with an n x n matrix which, when inverted, lets 
us express the true auto-correlations in terms of the observed auto- 
correlations. Once they are known, we can use Eq. MQ\ to express 
the observed cross-correlations as a linear combination of observed 
auto-correlation functions. 



o _ ■Sr^ o (^k)^ , r\ 



(12) 



where i 7^ j and gk{f) is a complicated function of the contamina- 
tion fractions. Note that the true number of objects from Eq. dlOl l 
has cancelled with that from Eq. (TTJ. 

The n X n matrix in Eq. l IlU above is very similar to that 
found in Eq. l|6). The diagonal elements tend to unity as the con- 
tamination between bins tends to zero, resulting in a nearly diag- 
onal matrix. The matrix will be strictly diagonally dominant and 
hence non-singular under the same condition found above, and for 
uniform contamination the matrix is singular. 

The total number of unknown parameters (/ij, where i 7^ j) 
is n(n — 1). This is clear since, in general, each of the n bins can 
contaminate each of the other bins (n — 1). Our goal is to constrain 
these parameters with Eq. dl2b . which yields only equations 
since ajfj — uji. If we could only measure the angular correlation 
function at one scale then it would be impossible to constrain the 
parameters. However, if we have two independent measurements 
of the cross-correlations at different scales, we double the number 
of equations and are able to constrain the problem. In practice the 
measurements of the angular correlation function at different scales 
are not independent, and we should strive to include as large a range 
of scales as possible. 

Eq. il2\ relates the amplitude of the cross-correlation to a 
weighted sum of the auto-correlations. If the shape of all the auto- 
correlations were the same, then it would be impossible to distin- 
guish their contributions to the cross-correlation, which would also 
have the same shape. For example if the auto-correlations were well 
described by power laws with the same slope. In such cases this 
method would yield completely degenerate parameter constraints. 



3.2 Pairwise analysis 

Here we first consider the case of exactly two redshift bins. This 
provides a simple case with which to test our method using the 
Millennium Simulation in ij4] We then expand this to multiple red- 
shift bins by considering each pair in turn. We show that this global 
pairwise analysis will yield accurate estimates for the entire set of 
contamination fractions if the contamination is sufficiently small. 

We consider two redshift bins, labelled 1 and 2. From Eq. 
we can express the observed number of galaxies in each bin in 
terms of the true number of galaxies and the contamination frac- 
tions. 



N? = iV?(l-/l2)+iV2'/21, 

m = NJil- f2l) + N'[f^2■ 



(m 



Each observed redshift bin contains those galaxies which do not 
contaminate the other bin (e.g. A'^i^(l — /12)) plus those galaxies 
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Figure 1. Upper bounds on the contamination fraction. /, as a function of 
the number of redshift bins n. The upper bound represents the limit at which 
the pairwise approach breaks down. A global contamination at these levels 
will cause at most a 10 per cent eiTor in the estimate of wj? (Eg. 115) . 



which are contaminating from the other bin (e.g. N2 f-zi)- Note that 
the total number of galaxies, N° + 7V2 = A'^i + N2 , is conserved. 

Inverting Eq. Jilt allows us to express the true auto- 
correlations in terms of the observed auto-correlations. The result- 
ing equalities can be plugged into Eq. l llOt yielding 



o -^1 r r 1 o -^2 J" -f 
'^11 ivS' 712/22 + '^221v^/2l7ll 



^^12 = 



(14) 



(,/ll/22 + /l2./'2l) 

As long as the two bins considered comprise the entire sample 
of galaxies, this formalism is consistent. Considering a sub-sample 
allows for leakage to and from the region exterior to the two bins. 
This can induce a cross-correlation due to mutual contamination of 
the considered bins from the exterior region and breaks the implicit 
assumption of galaxy number conservation. 

The two-bin case has practical applications such as back- 
ground selection in cluster lensing, where one seeks a background 
population that does not share members with the selected fore- 
ground cluster. 

We now address the global pairwise analysis which -if the 
contamination fractions are small- is a good approximation of the 
full-matrix approach detailed at the beginning of this section. In 
this case considering each pair of redshift bins in turn will yield 
estimates for the entire set of contamination fractions. We start by 
assuming that second order terms in the cross-contaminations are 
small, such that the off-diagonal elements of the matrix in Eq. i ll It 
satisfy /jj <C ,/h • This results in the following simple relationship 
between the observed and true auto-correlations: 



(15) 



NT] J_ 

When combined with Eq. dlOt this yields the following equa- 



tion for the observed cross-correlation: 



,fki/kj 



(16) 

All terms that are first order in / have been taken out of the sum- 
mation operator. This equation is identical to the two bin case of 
Eq. ( I14t if we assume that second order terms are negligible. 

This approximation will hold as long as the eigenvalues of 
the nxn matrix of Eq. i ll It are well approximated by the diagonal 



elements. Assuming a uniform cross contamination such that all /ij 
equal /, and therefore /n = (1 — (n — 1)/), we have the following 
matrix: 



/(l-(n-l)/)2 f 

f (l-(n-l)/)2 



f 



V 



(l-(n- 



1)/)V 
(17) 



The eigenvalues are (1 — (n — 1)/) — / with multiplicity n — 1 
and (1 — (n — 1)/)'^ + (n — 1)/^. The latter eigenvalue, being the 
larger deviation from the case of a purely diagonal matrix, gives us 
a constraint on /. If we require that the deviation be at most 10 per 
cent, we find 



(l-(n-l)/)^ 



l-(0.1(n-l))- 
n-1-10 

(2(n-l))-^ 



= 0.1, 



(18) 



n / 11 
n = 11 



The limit on the contamination becomes more stringent when more 
redshift bins are included in the analysis. Figure [T] gives the up- 
per limit on / as a function of the number of redshift bins. The 
contamination must be less than or equal to these values so that the 
maximum error made in determining cjjf via Eq. l ll5t is 10 per cent. 

With these limitations in mind we adopt the pairwise analysis 
throughout the rest of this work. A standard maximum likelihood 
procedure is used to estimate the contamination fractions by mea- 
suring the angular correlation functions at multiple scales and fit- 
ting them with Eq. l ll4t . Measurements of the angular correlation 
functions at different scales are not independent, thus the errors 
must be described by a covariance matrix. We use a bootstrapping 
technique to construct the covariance matrix, and include an addi- 
tional source of error which we refer to as the clustering covariance 
matrix I van Waerbeke, 2010). We refer the reader to appendix lAl 
for the details of the covariance matrix and the likelihood method. 



4 APPLICATION TO A SIMULATED GALAXY SURVEY 

The Millennium Simulation tracks the hierarchical growth of dark 
matter structure from a redshift of 127 to the present. The sim- 
ulation volume is a periodic box of 500 Mpc h^^ on a side, 
containing 2160'' particles each with a mass of 8.6 X 10**Mq. 
The simulation assumes a concordance model ACDM cosmology, 
fim = fidm + fib = 0.25, fib = 0.045, D.A = 0.75, h = 0.73, 
n — 1 and erg — 0.9, though the details of the cosmology do not 
affect any of the results prese nted here. A complet e description of 
the Simulation is presented in ISpringel et all feOOSi) and references 
therein. 

We employ mock observational catalogues constructed from 
the Millennium Simulation to test the method described in ^ 
The catalogues consist of six square pencil beam fields of 1.4 de- 
grees on a side, containing a total of 28.7 million galaxies at red- 
shifts less than 4.0. A full des cription of the catalogues is given 
by [at zbichler & White! ll2007h . To identify the mock catalogues, 
iRitzbichler & White! ilOm give each a label 'a' through 'f. We 
maintain this notation throughout the current work when distin- 
guishing between the catalogues. 

It is important to note that these mock observational cata- 
logues do not include the effects of lensing, and in particular they 
do not account for weak lensing magnification. Since magnification 
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Figure 3. The 36 dotted (gray) lines are all of the possible cross-correlations 
found by swapping the high and low redshift bins between fields. The 
dashed (green) line shows y=0. The solid (red) Hne is the average of the 
36 cross-correlations, and the error bar is the standard error The average is 
consistent with zero, showing that the the clustering term does not bias the 
cross-correlation. 



will create an angular correlation between high and low redshift, it 
is relevant to the current work and will be addressed in S|6] 

We first present the results when applying the two-bin analysis 
to the mock observational catalogues, and discuss the effects of bin 
width and galaxy density. We then apply the global pairwise analy- 
sis demonstrating its ability to recover the true number of galaxies 
in a redshift bin, and the true average redshift of galaxies in each 
redshift bin. 



4.1 NuU Test 

Since there is no contamination between redshift bins in the simu- 
lated data, the angular cross-correlation function between any two 
redshift bins ought to be consistent with zero. To test this, two red- 
shift bins are chosen so that there is, approximately, an equal num- 
ber of galaxies in each; the bins are 0.3 < zi < 0.5 which contains 
69,139 galaxies per field on average and 0.8 < Z2 < 0.85 which 
contains 50,575 galaxies per field on average. 

The top left panel of Figure |2] shows the result of measuring 
the angular correlation functions on each of the six fields. The auto- 
correlation functions for a given redshift slice have different ampli- 
tudes due to the presence of different structures in each of the fields. 
The insert shows a zoom of the cross-correlation functions, and the 
errors include a bootstrapping term as well as the clustering term 
(see appendix IaI. The clustering term dominates the error at all but 
the smallest scales, providing a relatively uniform source of noise 
which behaves like a constant shift in the cross-correlation func- 
tion. As seen in the the top left panel of Figure |2j the six fields 
provided by the mock catalogues are slightly biased towards a pos- 
itive shift. However, the clustering term is symmetric about zero 
and with enough measurements should have no bias. To test this we 
measured the cross-correlation between all combinations of high- 
and low-redshift bins, so that the high-redshift bin from field 'a' 
is cross-correlated with the low redshift bin of fields 'a', 'h\ 'c', 
'd', 'e' and 'f \ The resulting 36 cross-correlations are displayed in 
Figure [3] where the solid (red) line shows the average value which 
is consistent with zero at all scales. 

The contamination fractions are estimated from the measured 



angular correlation functions as described in appendix |A] The re- 
sulting likelihood is presented in the bottom left panel of Figure[2] 
Zero contamination is consistent with the results, and contamina- 
tion in excess of ~2 per cent can be ruled out at the 99.9 per cent 
confidence level. 

We conclude that, as expected, there is no significant angu- 
lar cross-correlation between widely separated redshift bins in the 
Millennium Simulation. Furthermore our method of estimating the 
contamination performs well in this case, indicating that there is 
less than ~1 per cent contamination at the 68 per cent confidence 
level. 



4.2 Artificial contamination 

Contamination was added to each field by randomly shifting galax- 
ies between the redshift bins. The contamination fractions were 
taken to be /12 = 0.08 and /21 — 0.04. The measured angular 
correlations are shown in the top right panel of Figure [2] where a 
cross-correlation signal is clearly seen. The recovered likelihood 
for the contamination fractions is shown in the bottom right panel 
of Figure |2] The input value lies in the 68 per cent confidence re- 
gion, and zero contamination (/12 — /21 = 0.00) is ruled out well 
beyond the 99.9 per cent confidence level. 

The parameter space is highly degenerate; contamination in 
either direction (low to high redshift, or vice versa) increases the 
cross-correlation amplitude. Degeneracy breaking comes from the 
ability to probe the shape of the auto-correlation functions. If only 
one scale is used, the parameters are completely degenerate. Like- 
wise if the auto-correlation of each redshift bin has the same shape 
(for example, if they are power laws with the same slope), then the 
parameters are completely degenerate. 

Each field has independent structure, causing variations in the 
measured correlation functions. It is important that this is not con- 
sidered a source of sample variance. Any features in the auto- 
correlation function will also be present in the cross-correlation 
function if contamination is present. The inserts in the bottom 
panels of Figure [2] show the parameter constraints obtained by 
analysing each field independently. These contours are combined, 
yielding a better measurement of the contamination. 

We have repeated this procedure for several different con- 
tamination fractions (/i2,/2i) including (0.0,0.02), (0.15,0.0), 
(0.0,0.15) and (0.0,0.45). In all cases the input contamination was 
recovered within the 68 per cent confidence region. 



4.3 Effect of galaxy density and redshift bin width 

Each of the six mock catalogues of lKitzbichler& White! tOOt 
contain on average 4.79 x 10® galaxies out to redshift 4. The two 
redshift bins used thus far, O.S < z < 0.5 and 0.80 < z < 0.85, 
were chosen to be well separated in redshift and contain roughly 
equal numbers of galaxies. On average there are ~ 6 x 10* galax- 
ies per redshift bin per field, corresponding to a density of 8.5 
arcmin"^. 

In order to test the effects of object density, galaxies were ran- 
domly removed from the mock catalogues which were then con- 
taminated with fi2 — 0.08 and /21 = 0.04. Densities of 4.3 and 
1.7 arcmin^^ were tested. We found that lower densities yielded 
only marginally weaker constraints compared to those in the bot- 
tom right panel of Figure (2] 

Reducing object density has a greater effect on constraints 
from individual fields; much of the constraining power results from 
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Figure 2. The left side panels are for the case of no contamination, the right side panels are for a contamination of /12 = 0.08 and /21 = 0.04. Top row: 
The angular auto- and cross-correlation functions for each test contamination of the mock catalogues. Each of the six fields is plotted with a different line style 
(colour). The higher redshift bin, 0.8 < 22 < 0.85, has a larger amplitude than the lower redshift bin, 0.3 < zi < 0.5. A zoom of the cross-correlation is 
given in the insert, and the legend therein identifies which line style applies to each of the six mock catalogues. The en'or bars on the auto-correlation functions 
come from bootstrapping the catalogue. The cross-correlation functions also include the contribution from the clustering covariance (refer to appendix IaI for 
details). The cross-correlation is consistent with zero for the case of zero contamination (left panel) and deviates significantly from zero for the contaminated 
case (right panel) Bottom row: The likehhood region of the contamination fractions. The shaded regions denote the 68, 95 and 99.9 per cent confidence areas, 
with increasing darkness indicating increasing significance. The contours result from summing the log-Liklihoods for the six individual fields. The insert 
shows the result for each of the individual fields. The input contamination in the right panel is marked with a dot. 



the combination of the six fields. As long as there are sufficient 
galaxies to measure the angular correlation function accurately, the 
contamination can be constrained. There is a disproportionately 
small change to the contours when the density is reduced since we 
are not dominated by Poisson noise. 

If more narrow redshift bins are taken, for example 0.3 < 
z < 0.4 and 0.8 < z < 0.82, then the constraints on the contam- 
ination improve considerably. Each bin contains 24,451 galaxies 
per field on average with a corresponding density of 3.5 arcmin"^. 
Figure |4] shows the parameter constraints when a contamination of 
/12 = 0.08 and /21 — 0.04 is used, the constraints for wider 
bins are over-plotted for comparison. The narrow bins offer much 
tighter constraints despite the density being lowered as a result of 
the smaller bin size. 

The angular correlation function for narrower redshift bins is 



more distinct since a larger fraction of galaxies are physically clus- 
tered with each other. Making the bins wider dilutes the sample 
with galaxies that are not clustered, reducing the correlation. Ar- 
bitrarily narrow bins will cluster strongly, providing a good mea- 
sure of the correlation function, but will suffer due to low number 
densities. More work is needed to determine an optimal trade-off 
between these parameters. 

4.4 Global pairwise analysis 

We divide the data into the following redshift bins: [0.0,0.5], 
(0.5,0.8], (0.8,1.1] and (1.1,1.5], which we label zi, Z2, Z3 and Z4 
respectively. The average number of galaxies in each bin from low 
to high redshift is: 60804, 94907, 63707, 55290. In order to test the 
global pairwise analysis, contamination fractions between all red- 
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Figure 4. Constraints on the contamination fractions for an input model 
of /i2 = 0.08 and /21 = 0.04, which is marked with a dot. The lined 
contours show the results for the wider redshift bins; 0.3 < z < 0.5 and 
0.80 < 2 < 0.85. The filled contours give the constraints for more naiTow 
redshift bins; 0.3 < 2 < 0.4 and 0.8 < 2 < 0.82. Narrow redshift bins 
provide tighter constraints despite the decreased number density. 



shift bins must be specified. In lieu of a completely random con- 
tamination matrix we have taken one which is similar to the con- 
tamination we measure for the CFHTLS-Deep fields in §[5] Since 
the measured contamination in the Deep fields appears to be on the 
extreme end of the global pairwise approximation we have reduced 
some of the contamination values, though the adopted contamina- 
tion matrix remains a strong test of the global pairwise approxima- 
tion. The contamination matrix used is, 

/0.70 0.01 0.10 0.10\ 

0.10 0.925 0.15 0.04 

0.05 0.06 0.70 0.08 

\0.15 0.005 0.05 0.78/ 



(19) 



where /ij refers to the entry in the i**^ column and the j'^" row. 
Galaxies are chosen randomly to be moved between redshift bins 
such that Eq. ^ is satisfied. 

The result of measuring the angular auto-correlation function 
for each redshift bin and the cross-correlation function for each 
pair of redshift bins, as measured from field 'a', is presented in 
the left panel of Figure |5] One hundred bootstraps of the contami- 
nated catalogues were constructed, from which the angular correla- 
tion functions were measured. These hundred measurements were 
used to estimate the covariance matrix, applying the correction de- 
scribed in Hartlap et al. (2007). For the auto-correlation functions 
(subplots on the diagonal) this is the only source of error, the cross- 
correlation functions (off-diagonal subplots) have the clustering 
term as an additional source of error. 

The pairwise analysis is applied to each pair of redshift bins 
yielding constraints on the contamination between each pair. The 
constraints on the contamination fractions for field 'a' are given by 
the filled contours in the right panel of Figure |5] Each of the six 
fields is considered independent, and their constraints on the con- 
tamination fractions are combined by multiplying the likelihoods 



together. These combined constraints are given as lined contours, 
over-plotted in Figure[5] 

The pairwise analysis does not impose global constraints since 
each pair of bins is considered independently. Not all combinations 
of contamination fractions suggested by Figure [5] will produce a 
consistent picture. For example, it is possible to select points from 
each plot in the right most column such that more than 100 per cent 
of galaxies from the highest redshift bin are contaminating other 
bins. To get a sense of which solutions are globally consistent, and 
what a typical global solution looks like, we employ a Monte-Carlo 
method. We first randomly sample 1000 points from each of the 
pairwise likelihood regions such that the density of points reflects 
the underlying probability distribution. Taking a random point from 
each pairwise analysis will uniquely specify a realization of the 
global contamination, i.e. the matrix of Eq. (|6}. Thus a realization 
of the true redshift distribution can be determined. We then impose 
two constraints: the true number of galaxies in a bin cannot be less 
than zero, and no more than 100 per cent of galaxies can be scat- 
tered from a single bin. This procedure is repeated until 200,000 
admissible realizations are found. 

We have verified that the admissible realizations of the con- 
tamination fractions are representative of the full probability dis- 
tributions they are drawn from. It is not the case that the global 
constraints exclude particular regions of parameter space. This is 
checked by constructing a probability distribution for each contam- 
ination fraction from the 200,000 measurements and comparing it 
to the marginalised PDF obtained from the likelihood contours. In 
all cases the two agree with each other. For the combined case, the 
estimated contamination fractions (taken to be the average value) 
with 68 per cent confidence levels are as follows. 



/ij 



^0.60j 
O.I2I 
O.lOl 



+0.04 
0.05 
+0.04 
0.03 
+0.02 
0.02 

\ n TQ+0-02 

\U.18_o 01 



0.03 
0.89 
0.07 
0.01 



+0.02 
0.02 

+0.04 
0.03 
+0.02 
0.02 
+0.00 
-0.01 



0.11 
0.18 
0.66 
0.05 



+0.03 
0.03 
+0.05 
0.04 
+0.06 
0.06 
+0.02 
0.05 



11+005\ 
"•J^J^-0.06 \ 
1+0.05 
-0.05 
R+0.07 
-0.03 
^+0.09 
-0.09 



0.09: 

0.16^ 
0.65^ 



3/ 
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This result agrees extremely well with the input contamination ma- 
trix of Eq. ( |19t , with II of the 16 contamination fractions, 69 per 
cent, agreeing within the 68 per cent error estimate. As expected 
the pairwise analysis slightly overestimates the level of contami- 
nation, which is probably more pronounced in this case since the 
contamination matrix is so aggressive. 

Using Eq. ^ allows us to estimate the true number of galax- 
ies in each redshift bin for each of the 200,000 realizations of the 
contamination matrix. This is done for each of the six fields as well 
as the combined case, the probability distributions for the estimated 
true number of galaxies are presented in Figure|6] The cross hashed 
regions denote the 68 per cent confidence level. Since this is sim- 
ulated data with artificial contamination we know what the actual 
uncontaminated number of galaxies is for a given bin, this is over- 
plotted as a dashed vertical line, for clarity we will refer to this as 
the true number of galaxies, and to our attempt to recover this as 
the estimated true number of galaxies. The solid vertical line shows 
the observed number of galaxies; i.e., the number of galaxies af- 
ter contamination. For the combined case the number of observed 
galaxies is taken to be the average number of galaxies from the 6 
fields. The global pairwise analysis does a good job of recovering 
the true number of galaxies in each bin, although in many cases the 
the observed number is also an acceptable fit, owing to the small 
contamination fractions and the fact that the redshift distribution 
was nearly flat to begin with. Focusing on the combined result for 
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Figure 5. Left panel: The angular correlation functions as measured for field 'a' . The subplots along the diagonal contain the auto-correlation for each redshift 
bin. The off-diagonal subplots contain the cross-correlation function between two bins. The redshift bin labels at the top of each column and to the left of 
each row denote which bins are involved in the given correlation function. The vertical scale on the top-left subplot applies to all subplots on the diagonal, 
the scale on the top-right subplot applies to all off-diagonal subplots. Right panel: The filled contours depict the parameter constraints estimated for field 
'a'. The shading from light to dark represents the 68, 95 and 99.9 per cent confidence levels. The lined contours are the result of combining the parameter 
constraints from each individual field, for clarity we only show the 68 and 99.9 per cent confidence levels. Each subplot contains the constraints for a pair of 
contamination fractions. The x-axis is taken to be the contamination fraction from the high redshift bin to the low redshift bin and the y-axis is the reverse. 
The two bins contributing to each subplot are indicated by redshift bin labels at the top of each column and to the left of each row. 



Z2 we see that the true number of galaxies is a significantly better 
fit to the estimated true number than the observed number. 

It is also possible to estimate the true average redshift of a red- 
shift bin. Where the true average redshift is defined as the average 
of the true redshifts of galaxies in a redshift bin that has contamina- 
tion. With real world data it is the same as asking what the average 
spectroscopic redshift is in a given photometric redshift bin. This is 
straight forward to calculate for a simulated survey since we know 
for each redshift bin the fraction of galaxies that came from each 
other redshift bin. The true average redshift is given by: 

1 

— T \ ^ — uncontam r 7\rT /a t \ 

Zi ^ J^Z^z^ /ki^k , (21) 

' k=l 

where ^J^"'^""'''™ tfjg uncontaminated average redshift of a galaxy 
in bin k, and f-kiN^ is the number of galaxies in bin i from bin k. In 
general the uncontaminated average redshift of a bin is not known 
since it requires knowledge of the true redshift of each galaxy. If 
we assume that the contamination does not significantly alter the 
shape of the redshift distribution at the sub-redshift bin level, then 
the average redshift of a bin will not be changed by contamination. 
Therefore we estimate the average redshift of an uncontaminated 
redshift bin (^j^ncontam^ average redshift of the contaminated 

redshift bin. 

For a given field, each of the 200,000 realizations of the con- 
tamination matrix yields an estimate of the true average redshift for 
each redshift bin. Thus a probability distribution function is con- 
structed for each bin and each field, including the combined case, 
this result is presented in Figure |7] The solid vertical lines show 



the average redshift with no contamination, this is what one would 
measure to be the average reshshift if the effects of contamina- 
tion were ignored. The dashed vertical lines show the true average 
redshfit for each bin, which we can measure directly here because 
we're working with simulated data with known contamination. Our 
method does a very good job of recovering the true average red- 
shift. The lowest and highest redshift bins suffer the most, this is 
expected since they can only be contaminated by galaxies that are 
either higher or lower in redshift respectively, whereas the middle 
bins are contaminated by galaxies which are both higher and lower 
than the average allowing for a cancellation of the effect. 

We have demonstrated that the global pairwise analysis can 
be used to reliably recover small contaminations between redshift 
bins. Using the Monte Carlo approach detailed in this section we 
have shown that we are able to estimate the true redshift distribu- 
tion, and the true average redshift within a bin. 



5 APPLICATION TO A REAL GALAXY SURVEY 

The Canada-France-Hawaii Telescope Legacy Survey (CFHTLS) 
is a joint Canadian-French program designed to take advantage 
of Megaprime, the CFHT wide-field imager. This 36-CCD mosaic 
camera has a 1 x 1 degree field of view and a pixel scale of 0.187 
arcseconds per pixel. The deep component consists of four one- 
square-degree fields imaged with five broad-band filters: u* , g', r', 
i', z . The fields are designated Dl, D2, D3 and D4, and are centred 
on RA;DEC coordinates of 02 26 00; -04 30 00, 10 00 28; -1-02 12 
21, 14 19 28; -l-52 40 41 and 22 15 32; -1-17 44 06 respectively 
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Figure 6. This demonstrates the ability of the global pairwise analysis to 
estimate the true (uncontaminated) redshift distribution. The x-axis is the 
number of galaxies in units of 1 X 10''. The y-axis is the probability, which 
has been scaled differently in each subplot for clarity. The histograms are 
the probability distribution of the true number of galaxies, and the cross- 
hashing denotes the 68 per cent confidence region. Each row of subplots is 
the result from one of the Millennium Simulation fields. The bottom row is 
the result when the constraints on the contamination fractions for each field 
are combined (see Figure|5). Please note that the bottom row is not a direct 
combination of the results from the other rows. Each column represents a 
redshift bin. as labelled. The solid vertical line in each subplot indicates the 
observed number of galaxies. The dashed vertical line is the true number of 
galaxies. 
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Figure 7. The global pairwise analysis is used to estimate the true average 
redshift of each photometric redshift bin. The x-axis is the average red- 
shift. The y-axis is the probability, which has been scaled differently in 
each subplot for clarity. The histograms are the probability distribution of 
the average redshift, and the cross-hashing denotes the 68 per cent confi- 
dence region. Each row of subplots is the result from one of the Millennium 
Simulation fields. The bottom row is the result when the constraints on the 
contamination fractions for each field are combined (see Figure [5). Please 
note that the bottom row is not a direct combination of the results from the 
other rows. Each column represents a redshift bin, as labelled. The solid 
vertical line in each subplot indicates the average redshift as measured from 
the uncontaminated catalogue. The dashed vertical line is the true average 
redshift. 



We u s e the deep photometric redshift catalogues from 
lllbert et alj ( |2006|) who have estimated redshifts for the T0003 
CFHTLS-Deep release. The redshift catalogue is publicly available 
at terapix.iap. fr. The full photometric catalogue contains 522286 
objects, covering an effective area of 3.2 deg^ A set of 3241 spec- 
troscopic redshifts with ^ z ^ 5 from the VIRMOS VLT Deep 
Survey (VVDS) were used as a calibration and training set for the 
photometric redshifts. The resulting photometric redshifts have an 
accuracy of cr(^p^„j-z,p„)/(i+z,p„,) = 0.043 for 4b = 22.5 - 24, 
with a fraction of catastrophic errors of 5.4 per cent. 

In this work we consider galaxies with 21.0 < i' < 25.0 and 
divide the data into the following redshift bins: [0.0,0.2], (0.2,1.5], 
(1.5,2.5] and (2.5,6.0], which we label zi, Z2, Z3 and Z4 respec- 
tively. The average number of galaxies in each bin from low to high 
redshift is: 5772, 96019, 15546, 5315. These redshift bins are cho- 
sen to isolate the low confidence redshifts and to probe the bump 
in the redshift distribution near redshift 3 (see Figure [8]l. Adopt- 
ing the definition of catastrophic error used bv lllbert et all j200q) , 
Az > 0.15(1 + z), we measure the fraction of galaxies with catas- 
trophic errors in each bin. From low to high redshift we find: 23.2, 
12.4, 33.2 and 23.1 per cent. It is difficult to interpret these values 
in the context of the contamination fractions. Photometric redshifts 
with large errors do not necessarily contaminate other redshift bins. 
Furthermore Az does not take into account degeneracies in the 
spectral template fitting that allow for alternative solutions to the 
galaxy's redshift. The catastrophic errors give some indication of 



the leakage between adjacent redshift bins but cannot account for 
those galaxies which are completely misclassified. 



5.1 Applying the global pairwise analysis 

We apply the two-bin analysis between each pair of redshift bins, 
and for each of the four Deep fields. The measured angular corre- 
lation functions for Dl are presented in Figure |9] The covariance 
matrices are estimated by bootstrapping the catalogues 100 t imes, 
and applying the correction described bv Hartlap et al] ( l2007t) . For 
the cross-correlation covariance matrix we also calculate the clus- 
tering covariance described in appendix |A] Parameter constraints 
are estimated for each cross-correlation, those for Dl are presented 
in Figure |9] The other three fields have very similar angular cor- 
relation functions and parameter constraints. The parameter con- 
straints for all of the fields can be combined by treating them as 
statistically independent and multiplying their likelihoods together, 
yielding tighter constraints (lined contours in Figure|9j. 

We constructed 200,000 realizations of a globally consistent 
contamination matrix, as detailed in i]4.4| and verified that the ad- 
missible realizations are representative of the full probability distri- 
butions they are drawn from. The contamination matrix estimated 
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Figure 8. Finely binned redsliift distribution for each of the four deep fields. 
The vertical lines denote the binning adopted in our global pairwise analy- 
sis. Note the bump near redshift 3. 



from the combined constraints is, 



+0.17 

0.17 
+0.06 
0.18 
+0.05 
-0.16 

Vo.ii«:?? 



/0.50:! 

0.18^ 

o.iej 



U.UUD_o QQ3 



+0.014 
0.006 
+0.001 
0.006 

U.UU1_Q 001 



0.979 
0.006 



0.31 
0.53 

0.07; 



+0.08 
-0.04 
+0.05 
0.10 
+0.04 
0.02 



0.04+001 



0.09 
0.74 



0.04 
+0.03 
0.09 
+0.10 
0.10 



(22) 

where the entries represent the average value calculated from their 
probability distribution. The maximum contamination fraction for 
four redshift bins presented in Figure[T]is ~0. 12 which is smaller 
than some of the contamination fractions found here (or seen in 
Figure [9}. It is possible that this is an indication that the pairwise 
analysis does not hold for these data; however, several assumptions 
were made in deriving the maximum contamination fraction which 
do not hold here. We assumed that all contamination fractions have 



the same value; this is not the case here, and although some tend to 
be larger than 0.12, some are very close to zero, and several have 
large errors encompassing zero. We also assumed that there were 
equal numbers of galaxies in each bin. The ratio of the number 
of galaxies between the pair of bins enters into the expression for 
the observed cross-correlation. Since the first and last redshift bins 
have far fewer galaxies, a large contamination fraction from one 
of these bins represents only a small number of galaxies. We have 
also demonstrated our ability to recover the contamination fractions 
from a similarly aggressive contamination matrix in i |4.4| For these 
reasons we believe that the global pairwise analysis remains a good 
approximation here. 

The probability distribution of the true number of galaxies for 
each redshift bin and each field is presented in Figure fTOl the cross- 
hatched regions indicate 68 per cent confidence. The observed 
number of galaxies in each bin is denoted by a vertical line. The 
bottom row contains the result when the constraints on the con- 
tamination fractions for each of the four fields are combined. The 
smallest fractional change is for Z2 which is the high confidence 
photometric redshift bin. Taking the peak of the probability distri- 
bution indicates about a factor of two less galaxies in the highest 
redshift bin than are observed, suggesting that the bump seen in the 
photometric redshift distribution is an artefact of contamination. 
However, with only four square degrees of data, we are unable to 
rule out the existence of this feature. 

The set of globally consistent realizations of the contamina- 
tion can also be used to estimate the average redshift for each pho- 
tometric redshift bin. We use Eq. and estimate the uncontam- 
inated average redshift of each bin (^^ncontam-^ average of 

the photometric redshifts. Which is a good approximation as long 
as the shape of the observed redshift distribution within each bin is 
similar to that of the true redshift distribution. 

The results are presented in Figure [TT] which shows the prob- 
ability distribution of the average redshift for each redshift bin and 
each field. Vertical lines show the average redshift for each bin mea- 
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Figure 10. This demonstrates the abihty of the global pairwise analysis to 
reconstract the true (uncontaminated) redshift distribution. The x-axis is the 
number of galaxies in units of 1 X 10*. The y-axis is the probability, which 
has been scaled differently in each subplot for clarity. The histograms are 
the probability distribution of the true number of galaxies, and the cross- 
hashing denotes the 68 per cent confidence region. Each row of subplots 
is the result from one of the CFHTLS-Deep fields. The bottom row is the 
result when the constraints on the contamination fractions for each field are 
combined. Each column represents a redshift bin, as labelled. The vertical 
line in each subplot indicates the observed number of galaxies. 



Figure 11. The global pairwise analysis is used to estimate the true average 
redshift of each photometric redshift bin. The x-axis is the average redshift. 
The y-axis is the probability, which has been scaled differently in each sub- 
plot for clarity. The histograms are the probability distribution of the av- 
erage redshift, and the cross-hashing denotes the 68 per cent confidence 
region. Each row of subplots is the result from one of the CFHTLS-Deep 
fields. The bottom row is the result when the constraints on the contamina- 
tion fractions for each field are combined. Each column represents a redshift 
bin, as labelled. The vertical line in each subplot indicates the average red- 
shift as measured from the photometric redshift catalogue. 



sured from the photometric catalogue. It is clear that the smallest, 
and largest, redshift bins (zi, and Z4) contain galaxies whose true 
average redshifts deviate significantly from the average redshift ex- 
pected for those redshift bins. This suggests that many galaxies in 
bin zi are in fact from much higher redshifts. Similarly galaxies in 
bin Z4 have a lower than expected average redshift. 



6 CONCLUSION AND DISCUSSION 

We have presented an analytic framework for estimating contami- 
nation between photometric redshift bins, without the need for any 
spectroscopic data beyond those used to train the photometric red- 
shift code. To measure the contamination between redshift bins we 
exploit the fact that mixing between bins will result in a non-zero 
angular cross-correlation between those bins. We have shown how 
the contamination will affect the observed angular correlation func- 
tions for the general case of contamination between an arbitrary 
number of bins. For the case of two- and three-bins we explicitly 
work out the equations. 

The case of two-bins is given special attention since it is the 
simplest case. We note that if the contamination between bins is 
small enough, then each pair of bins can be considered indepen- 
dently, yielding an accurate measure of contamination between all 
bins. We refer to this as a global pairwise analysis. 

We test our formalism with mock galaxy catalogues created 
from the Millennium Simulation. We verify that there is no evi- 
dence of contamination, finding an upper limit of ~2 per cent at 
the 99.9 per cent confidence level. The catalogues are then contam- 



inated by moving galaxies between redshift bins. We demonstrate 
that the two-bin analysis is able to recover input contamination be- 
tween redshift bins. The effects of galaxy density and bin-width are 
investigated. We find that our ability to constrain the contamination 
fractions is not very sensitive to object density, whereas narrower 
bins offer better constraints. 

We split the mock catalogues into four redshift bins and ap- 
ply artificial contamination between all pairs. The global pairwise 
analysis is used to constrain the contamination fractions between 
all pairs of redshift bins. A Monte-Carlo method is then used to es- 
timate the true (uncontaminated) redshift distribution, and the true 
average redshift of galaxies in each bin. This is valuable informa- 
tion for the cosmological interpretation of galactic surveys, and in 
particular weak lensing by large scale structure. We demonstrate 
the ability of the method to accurately recover the input contamina- 
tion as well as reconstruct the true redshift distribution and average 
redshift of each bin. 

The formalism is applied to a real galaxy survey; the four 
square degree deep component of the Canada-France-Hawaii Tele- 
scope Le gacy Su rvey for which there are photometric redshift cat- 
alogues jllbert et al.L 120061") . We divide the data into four redshift 
bins and apply the global pairwise analysis. This yields constraints 
on the contamination fractions, the true redshift distribution, and 
the true average redshift of galaxies in each bin. We demonstrate 
here the feasibility of the method with only four square degrees of 
sky coverage; future application to large galaxy surveys will signif- 
icantly improve constraining power. 

This work has focused on the application of the two-bin and 
global pairwise methods. For a small number of redshift bins, with 
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sufficiently small contamination, the global pairwise analysis offers 
a quick and easy means of assessing the contamination between 
redshift bins. The benefit is largely computational, it is very fast to 
constrain a model with only two free parameters. A more sophis- 
ticated method (such as a Monte-Carlo-Markov-Chain (MCMC)) 
will be needed to implement the full multi-bin approach. With only 
three redshift bins there is a total of six free parameters which al- 
ready renders the simple maximum-likelihood approach impracti- 
cal. Although the full multi-bin approach will yield the most ac- 
curate results, for some applications the more simplistic pairwise 
analysis should suffice. 

Future work will need to take into account the effect of 
weak lensing magnification, which causes an angular cross- 
correlation between background galaxies and foreground lenses. 
Since foreground lenses boost the magnitude of background galax- 
ies there are more close pairs detected between these redshift 
slices then one would expect from random placement. This ef- 
fect is well understood an d can be easily modelled and accounted 
for jScranton et al.L 120051) , however, since it depends on cosmol- 
ogy and the redshifts of the lens and background galaxies it can 
not be removed in a model-independent way. Dust extinction of 
the background sou rces is also im portant, reducing the brightness 
of lensed galaxies (i Menard et al.L |200S^) . This de-magnification is 
wavelength dependant, and in visible passbands is comparable in 
magnitude to lensing magnification, therefore, it will need to be 
accounted for along with magnification. 

The expected amplitude of the angular cross-correlation due to 
magnification is small. Using the CFHTLS-D eep fields, and amag- 
nitude cut similar to that used in this work, Ivan Waerbekd l [201oh 
find the amplitude of the angular cross-correlation between the red- 
shift bins z = [0.1,0.6] and z = [1.1, 1.4] to be about 0.01 on 
scales smaller than 1 arcmin. While this is clearly an important 
contribution to the angular cross-correlations measured here it can 
only account for about 10 per cent of the observed signal on these 
scales. 

Photometric redshifts are more secure for red galaxy types. 
Furthermore red and blue galaxies cluster differently resulting in 
distinct angular correlation functions. Therefore, it is likely the case 
that galaxies doing the contamination are predominantly blue and 
exhibit a systematically different angular correlation than the red 
galaxies which do not contaminate. More work is needed to under- 
stand the severity of this bias. However, since any cross-correlation 
signal (above that expected from magnification) indicates contami- 
nation it is always possible to use this technique as a null test. 

We have presented a method of measuring contamination be- 
tween photometric redshift bins using the angular correlation func- 
tion, and without any need for spectroscopically determined red- 
shifts. The method is able to constrain the true redshift distribution 
and the true average redshift in a photometric bin, both of which are 
of keen interest to cosmological use of these data. The accuracy of 
this method will need to be improved to address the needs of high 
precision cosmology. The inclusion of the galaxy-shear correlation 
function t o break parameter degeneracies has been investigated by 
IZhang et'al . (2009), showing that the stringent requirements of fu- 
ture surveys can be reached if this information is included. Without 
the need for accurate weak lensing shear measurements, the method 
we present here is more accessible and provides valuable informa- 
tion. 



ACKNOWLEDGMENTS 

JB is supported by the Natural Sciences and Engineering Research 
Council (NSERC), and the Canadian Institute for Advanced Re- 
search (CIAR). LVW is supported by NSERC, CIAR and the Cana- 
dian Foundation for Innovation (CFI). The Millennium Simula- 
tion databases used in this paper and the web application provid- 
ing online access to them were constructed as part of the activi- 
ties of the German Astrophysical Virtual Observatory. This work 
is partly based on observations obtained with MegaPrime equipped 
with MegaCam, a joint project of CFHT and CEA/DAPNIA, at the 
Canada-France-Hawaii Telescope (CFHT) which is operated by the 
National Research Council (NRC) of Canada, the Institut National 
des Science de I'Univers of the Centre National de la Recherche 
Scientifique (CNRS) of France, and the University of Hawaii. This 
work is based in part on data products produced at TERAPIX and 
the Canadian Astronomy Data Centre as part of the Canada-France- 
Hawaii Telescope Legacy Survey, a collaborative project of NRC 
and CNRS. This paper makes use of photometric redshifts pro- 
duced jointly by Terapix and VVDS teams. 



References 

Erben T., Hildebrandt H., Lerchster M., Hudelot P., Benjamin J., 
van Waerbeke L., Schrabback T., Brimioulle F., Cordes O., 
Dietrich J. P., Holhjem K., Schirmer M., Schneider P., 2009, 
A&A,493, 1197 
Hartlap J., Simon P, Schneider P, 2007, A&A, 464, 399 
Huterer D., Kim A., Krauss L. M., Broderick T., 2004, ApJ, 615, 
595 

Ilbert O., Arnouts S., McCracken H. J., Bolzonella M., Bertin E., 
Le Fevre O., Mellier Y., Zamorani G., Pello R., lovino A., 
Tresse L., Le Brun V., Bottini D., Garilli B., Maccagni D., Pi- 
cat J. P., ScaramellaR., Scodeggio M., Vettolani G., Zanichelli 

A. , Adami C, Bardelh S., Cappi A., Chariot S., Cihegi P, 
Contini T., Cucciati O., Foucaud S., Franzetti P., Gavignaud 
I., Guzzo L., Marano B., Marinoni C, Mazure A., Meneux 

B. , Merighi R., Paltani S., Polio A., Pozzetti L., Radovich M., 
Zucca E., Bondi M., Bongiomo A., Busarello G., de La Torre 
S., Gregorini L., Lamareille E, Mathez G., Merluzzi P., Ripepi 
v., Rizzo D., Vergani D., 2006, A&A, 457, 841 

Kerscher M., Szapudi I., Szalay A. S., 2000, ApJ, 535, L13 
Kitzbichler M. G., White S. D. M., 2007, MNRAS, 376, 2 
Landy S. D., Szalay A. S., 1993, ApJ, 412, 64 
Menard B., Scranton R., Fukugita M., Richards G., 2010, MN- 
RAS, 405, 1025 
Newman J. A., 2008, ApJ, 684, 88 

Peebles P. J. E., 1980, The large-scale structure of the uni- 
verse. Research supported by the National Science Founda- 
tion. Princeton, N.J., Princeton University Press, 1980. 435 p. 

Schneider M., Knox L., Zhan H., Connolly A., 2006, ApJ, 651, 
14 

Scranton R., Menard B., Richards G. T., Nichol R. C, Myers 
A. D., Jain B., Gray A., Bartelmann M., Brunner R. J., Con- 
nolly A. J., Gunn J. E., Sheth R. K., Bahcall N. A., Brinkman 
J., Loveday J., Schneider D. P, Thakar A., York D. G., 2005, 
ApJ, 633, 589 

Springel V., White S. D. M., Jenkins A., Frenk C. S., Yoshida N., 
Gao L., Navarro J., Thacker R., Croton D., Helly J., Peacock 
J. A., Cole S., Thomas P., Couchman H., Evrard A., Colberg 
J., Pearce E, 2005, Nature, 435, 629 



Photometric redshift contamination 1 3 



van Waerbeke L., 2010, MNRAS, 401, 2093 

Zhang P., Pen U., Bernstein G., 2010, MNRAS, 405,359 



APPENDIX A: COVARIANCE AND LIKELIHOOD 

Here we present the details of the maximum-likelihood method, 
and the covariance matrix used. Throughout the paper we fit the ob- 
served angular cross-correlation between two redshift bins with the 
model described by Ea.lll4t. Since there are observational errors 
associated with the angular auto- and cross-correlation functions, 
we have grouped these quantities on the left hand side of Eq.lll4t. 
yielding: 



0, 
(Al) 

where the angular correlation functions are written as a function 
of scale 0. For simplicity let F represent the left hand side of the 
equation. We therefore seek to calculate the likelihood. 



1 



exp 



-i(F-m)C-^(F-m)^ 



(A2) 



where s is the number of angular scale bins, m is the model which 
is zero for all scales and C is the s x s covariance matrix. The 
covariance matrix is 



Figure Al. The covariance matrix of the cross-correlation {uj°^ (d^. (Si)) 
has two contributions. We take field 'a' of the Millennium Simulation and 
two redshift bins as described in i |4. 1 1 The LEFT pan el shows the clustering 
term calculated as described in van Waerbekd iioiol) . The matrices are 6x6 
and increase in scale from bottom to top and left to right. The RIGHT panel 
is the total covaiiance which includes the bootstrap covariance matrix in 
addition to the clustering covariance. The scale on the right is in units of 
10^*. Note that the clustering term results in a very flat covariance between 
all scales, whereas the bootstrap covariance adds relatively little and only 
on the smallest scales. 



and the survey geometry. We add the covariance due to the cluster- 
ing term to our bootstrap covariance matrix. The left panel in Fig- 
ure lAll shows the covariance from the clustering term and the right 
panel shows the total covariance of the cross-correlation function 
between two redshift bins of the Millennium Simulation (see i]4.U . 
This constitutes the largest contribution to the covariance matrix of 
Eq. l lA4t . 



Cfci — {F^Fi), 



(A3) 



where k and I denote the scales at which the angular correlation 
functions are measured. Expanding the above yields 



Cki 



+ 



+ 



(A4) 



N° 



N° 



Ideally the covariance matrix can be estimated directly from the 
data, but this requires many fields. It is not possible to do this for 
either the Millennium Simulation or the CFHTLS-Deep data sets 
which we consider in this work. An alternative is to use a boot- 
strapping method, wherein the data catalogue is resampled multiple 
times, and each resampled catalogue is used to measure the angular 
correlation functions. These angular correlation functions can then 
be used to calculate the covariance. 

This procedure suffices for all contributions to the covariance 
matrix Eq. (|A4t . except the first term {ijjfj{9k)ijJ°j{6i)) ■ As shown in 
Ivan Waerbeke (2010), the covariance of the cross-correlation func- 
tion has a contribution due to the intrinsic clustering of the back- 
ground and foreground populations. This so-called clustering term 
can be calculated analytically given the auto-correlation functions 
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APPENDIX B: SOLVING THE THREE-BIN CASE ANALYTICALLY 

For three bins it is easy to derive equations for tiie tliree observed cross-correlation functions, analogous to what is presented in Eq. il4b for 
the two-bin case. Let the nxn matrix in Eq. Jilt be called F. The inverse can be calculated using the adjoint, 

1 _ adj(F) 

^ -d^- ^^^^ 
Both the adjoint (adj(F)) and the determinant (det(F)) can be calculated easily: 




adj(F) = 

det (F) = /ii /I2 /la + /li flz fiz + /li /12 /Is 

~/i 1/32/23 ~ /21/12/33 ~ /31/22/13 (B2) 



With the inverse of of the matrix in hand we can now use Eq. i ll lb to write the true auto-correlations in terms of the observed auto-correlations, 

a;Jdet(F) = (^) ' {flfL ~ fill) + (^) ' (A'i/i - /1A\) + ^l^ (^) ' {flfl - /£/i^), (B3) 

where iT^jT^k. The three true auto-correlation functions are found by permutations of the indicies (i,j,k)=(l,2,3), (2,3,1) and (3,1,2). Note that 
the equation is symmetric in the last two indicies yielding the same result for (i,j,k) and (i,k,j). Substituting these equations into Eq. l llOt for 
the observed cross-correlations we find. 



a;!;;det(F) = '^■°']v^ [/ii/ij (/jj/kk - /kj/jk) + /ji/jj (/kj/ik - /ij/kk) + /ki/kj(/y/ji - /jj/i) 



+ [/ii/ij (/ki/jk - /jl/kk) + /ji/kk(/u/kk ^ /ki/ik) + /ki/kj (/j^/ik ^ /if/jk)] 

jY°^ 22 22 2222 2222 

+ ^kk jyo^o [/ii/ij (/ji/kj ^ /ki/jj) + /ji/jj (/ki/ij ~ /ii/kj) + /ki/kj (/ii/jj " /ji/ij)] (B4) 

Permuting the indicies as above yields equations for the three observed cross-correlation functions. There are three equations and six un- 
knowns -note that /n, /jj and /kk depend only on /ij, /ik, /ji, /jk, /ki and /kj. By considering more than one scale we can double the number 
of equations making the system constrained. 



